Solution-based methods and materials for sequence analysis by hybridization

ABSTRACT

Novel solution-based methods and materials, including apparatus, for sequence analysis by hybridization are provided.

[0001] This application is a continuation-in-part of U.S. applicationSer. No. 09/277,383 filed Mar. 25, 1999, incorporated herein byreference.

FIELD OF THE INVENTION

[0002] The invention relates generally to novel methods and materialsfor nucleic acid sequence analysis by hybridization, in which thehybridization reaction occurs in a solution environment.

BACKGROUND

[0003] The rate of determining the sequence of the four nucleotides innucleic acid samples is a major technical obstacle for furtheradvancement of molecular biology, medicine, and biotechnology. Nucleicacid sequencing methods which involve separation of nucleic acidmolecules in a gel have been in use since 1978. The other proven methodfor sequencing nucleic acids is sequencing by hybridization (SBH).

[0004] The traditional method of determining a sequence of nucleotides(i.e., the order of the A, G, C and T nucleotides in a sample) isperformed by preparing a mixture of randomly terminated, differentiallylabeled nucleic acid fragments by degradation at specific nucleotides,or by dideoxy chain termination of replicating strands. Resultingnucleic acid fragments in the range of 1 to 500 bp are then separated ona gel to produce a ladder of bands wherein the adjacent samples differin length by one nucleotide.

[0005] SBH does not require single base resolution in separation,degradation, synthesis or imaging of a nucleic acid molecule. Usingmismatch discriminative hybridization of short oligonucleotides K basesin length, lists of constituent K-mer oligonucleotides may be determinedfor target nucleic acid. Sequence for the target nucleic acid may beassembled by uniquely overlapping scored oligonucleotides.

[0006] There are several approaches available to achieve sequencing byhybridization. In a process called SBH Format 1, nucleic acid samplesare arrayed, and labeled probes are hybridized with the samples. Replicamembranes with the same sets of sample nucleic acids may be used forparallel scoring of several probes and/or probes may be multiplexed.Nucleic acid samples may be arrayed and hybridized on nylon membranes orother suitable supports. Each membrane array may be reused many times.Format 1 is especially efficient for batch processing large numbers ofsamples.

[0007] In SBH Format 2, probes are arrayed at locations on a substratewhich correspond to their respective sequences, and a labeled nucleicacid sample fragment is hybridized to the arrayed probes. In this case,sequence information about a fragment may be determined in asimultaneous hybridization reaction with all of the arrayed probes. Forsequencing other nucleic acid fragments, the same oligonucleotide arraymay be reused. The arrays may be produced by spotting or by in situsynthesis of probes.

[0008] In Format 3 SBH, two sets of probes are used. In one embodiment,a set may be in the form of arrays of probes with known positions, andanother, labeled set may be stored in multiwell plates. In this case,target nucleic acid need not be labeled. Target nucleic acid and one ormore labeled probes are added to the arrayed sets of probes. If oneattached probe and one labeled probe both hybridize contiguously on thetarget nucleic acid, they are covalently ligated, producing a detectedsequence equal to the sum of the length of the ligated probes. Theprocess allows for sequencing long nucleic acid fragments, e.g. acomplete bacterial genome, without nucleic acid subcloning in smallerpieces.

[0009] However, to sequence long nucleic acids unambiguously, SBHinvolves the use of long probes. As the length of the probes increases,so does the number of probes required to generate sequence information.Each 2-fold increase in length of the target requires a one-baseincrease in the length of the probe, resulting in a four-fold increasein the number of probes required (the complete set of all possiblesequences of probes of length k contains 4^(k) probes). For example,sequencing 100 bases of DNA requires 16,384 7-mers; sequencing 200 basesrequires 65,536 8-mers; 400 bases, 262,144 9-mers; 800 bases, 1,048,57610-mers; 1600 bases, 4,194,304 11-mers; 3200 bases, 16,777,216 12-mers;6400 bases, 67,108,864 13-mers; and 12,800 bases requires 268,435,45614-mers.

[0010] Because a limited number of probes can be scored in eacharray-based hybridization reaction, use of an extremely large number ofprobes requires carrying out multiple hybridization reactions.

[0011] An improvement in SBH that increases efficiency and reduces thenumber of hybridization reactions would greatly enhance the practicalability to sequence long pieces of polynucleotides de novo. Such animprovement would, of course, also enhance resequencing and otherapplications of SBH. Thus, there remains a need for additional andimproved methods and materials for performing sequence analysis byhybridization.

SUMMARY OF THE INVENTION

[0012] The present invention provides novel methods and materials,including apparatus and kits, for performing sequence analysis byhybridization (referred to herein as “SBH”). According to the presentinvention, the efficiency, sensitivity and accuracy of these methods isimproved by performing the entire hybridization step in solution,preferably coupled with single probe molecule detection. The methods andmaterials of the present invention advantageously allow for easierpreparation of probes without attaching them to a fixed support, allowthe use of larger numbers and different types of probes, improvehybridization and enzymatic kinetics relative to solid-phasehybridization (when either target or probe(s) are bound to a solidsupport), and allow for use of a different range of detection devices.

[0013] In one aspect, the invention provides methods of detecting asequence of a target nucleic acid, comprising: (a) contacting a targetnucleic acid with one or more mixtures of a plurality of oligonucleotideprobe molecules of predetermined length and predetermined sequence,wherein each probe molecule comprises an information region and at leasttwo probe molecules have different information regions, under conditionswhich produce, on average, more probe:target hybridization with probemolecules which are perfectly complementary to the target nucleic acidin the information region of the probe molecules than with probemolecules which are mismatched in the information region, wherein thetarget nucleic acid is not attached to a support, and wherein the probemolecules are not attached to a support; (b) detecting probe moleculesthat hybridize with the target nucleic acid, using a reader capable ofdetecting an individual probe molecule; and (c) detecting a sequence ofthe target nucleic acid by overlapping sequences of the informationregions of at least two of the probe molecules contacted with the targetin step (a). Methods of the invention are carried out wherein at leasttwo mixtures are contacted simultaneously, or alternatively wherein atleast two mixtures are contacted sequentially. Methods of the inventioninclude those wherein at least about 10 probe molecules distinct intheir information regions, at least about 100 probe molecules distinctin their information regions, at least about 1,000 probe moleculesdistinct in their information regions, or at least about 10,000 probemolecules distinct in their information regions. In one aspect, methodsof the invention include probe molecules that comprise modified bases.

[0014] Multiple probe molecules of the invention may also be associatedwith identification tags, and in one aspect, multiple probe moleculeseach have two identification tags. In one aspect, methods may includemultiple probe molecules having the same information region which areeach associated with the same identification tag. In another aspect, atleast two probe molecules having different information regions areassociated with different identification tags.

[0015] Methods of the invention include those wherein the probemolecules are divided into pools, wherein each pool comprises at leasttwo probe molecules having different information regions, and all probemolecules within each pool are associated with the same identificationtag which is unique to the pool. In one aspect, at least oneidentification tag is a bar code. Methods are provided wherein the barcode is based on a property selected from the group consisting of size,shape, electrical properties, magnetic properties, optical properties,and chemical properties. Alternatively, the identification tag is a DNAbar code comprising modified bases, a molecular bar code, or ananoparticle bar code. In one aspect, the bar code comprises elements ofvarying length, each element comprising a preset number of unit tags.The preset number may vary, e.g., may be 1, 2, 3, 4, 5 or more dependingon the desired number of combinations and the type of unit tags.

[0016] In another aspect, methods of the invention include a targetnucleic acid which is associated with a separator tag. Alternatively,methods are provided wherein the probe molecules are associated withseparator tags.

[0017] The invention further provides methods wherein before thedetection step (b) described above, probe molecules that hybridize tothe target nucleic acid are separated from probe molecules that do nothybridize to the target nucleic acid. In one aspect, probe moleculesthat do not hybridize to the target nucleic acid are eliminated byenzymatic digestion.

[0018] The invention also provides methods wherein step (b) furthercomprises counting the number of times probe molecules having the sameinformation region are detected. In one aspect, the methods of theinvention include a reader comprising a nanopore channel which is usedto detect probe molecules in step (b). Alternatively, methods includesensing of electrical responses within or around the nanopore channel isused to detect probe molecules in step (b). In one aspect, the readerdetects molecular bar codes in step (b).

[0019] The invention further provides methods wherein the probemolecules are associated with one or more tags that allow identificationof 5′/3′ orientation of probe molecules during detection step (b). Inanother embodiment, methods of the invention, the sequence of the probemolecule(s) is detected in step (b). In one aspect, methods are providedwherein at least two probe molecules are associated with identificationtags and the identification tags are also detected in step (b).

[0020] The invention further provides methods of sequencing a targetnucleic acid, comprising: (a) contacting a target nucleic acid with oneor more mixtures of a plurality of oligonucleotide probe molecules ofpredetermined length and predetermined sequence, wherein each probemolecule comprises an information region and at least two probemolecules have different information regions, under conditions whichproduce, on average, more probe:target hybridization with probemolecules which are perfectly complementary to the target nucleic acidin the information region of the probe molecules than with probemolecules which are mismatched in the information region, wherein thetarget nucleic acid is not attached to a support, and wherein the probemolecules are not attached to a support; (b) covalently joining probemolecules that form contiguous probe:target hybrids that are perfectlycomplementary to the target in the information region of the probemolecules; and (c) detecting covalently joined probe molecules, using areader capable of detecting an individual probe molecule. In anotheraspect, methods of the invention further comprise the step of: (d)detecting a sequence of the target nucleic acid by overlapping at leasttwo sequences generated by combining sequences of the information regionof two probe molecules contacted with target nucleic acid in step (a).As used herein, “combining sequences of the information region of twoprobe molecules” means contiguously combining sequences in proper 5′-3′orientation. In another embodiment, methods are provided wherein beforedetection step (c), covalently joined probe molecules are separated fromprobe molecules that have not been covalently joined.

[0021] Also provided are methods of the invention wherein at least onenucleotide is added to the end of one or more probe molecules thathybridize to target nucleic acid using a polymerase or active fragmentthereof. In one aspect, the probe molecules are contacted with a mixtureof four different uniquely labeled nucleotides.

[0022] Methods are provided wherein target nucleic acids comprising anentire human genome are contacted with probe molecules. Alternatively,methods are provided wherein a single nucleotide polymorphism isdetected.

[0023] The invention further provides kits comprising a mixture of probemolecules, wherein about 100 or more probe molecules each have distinctinformation regions, wherein two or more of the sequences of saiddistinct information regions within the mixture overlap. In one aspect,the kits include about 10⁵ or less probe molecules each have the sameinformation region. In another embodiment, the kits include about 10⁴ orless probe molecules each have the same information region.Alternatively, kits of the invention include those wherein eachinformation region is represented by 10⁴ or more probe molecules havingthe same information region. Also provided are kits wherein at least twoprobe molecules having the same information region have the sameidentification tag.

[0024] In one aspect, kits are provided comprising a set of mixtures ofprobe molecules, wherein about 100 or more probe molecules each havedistinct information regions, wherein two or more of the sequences ofsaid distinct information regions within the set overlap. In one aspect,kits of the invention are provided wherein about 10⁵ or less probemolecules each have the same information region. In another aspect, kitsare provided wherein at least two probe molecules having differentinformation regions are in the same pool and have the sameidentification tag. Kits are also provided wherein about 5000 or moreprobe molecules each have the same information region.

[0025] The invention further provides tags which are bar codescomprising an alternating arrangement of elements of varying detectableproperties, wherein consecutive elements have a difference in at leastone of the detectable properties. In one aspect, the elements in the tagcomprise multiple unit tags of varying detectable properties and saidelements vary in length.

[0026] The present invention provides methods for analyzing the sequenceof a target nucleic acid, comprising the steps of (a) contacting atarget nucleic acid with a mixture of a plurality of oligonucleotideprobes (which may include a plurality of probe molecules) ofpredetermined length and predetermined sequence, wherein each probemolecule comprises an information region, under conditions whichdiscriminate between probe:target hybrids that are perfectlycomplementary in the information region of the probe and probe:targethybrids that are mismatched in the information region of the probe(i.e., under conditions which produce, on average, more probe:targethybridization with probes which are perfectly complementary to thetarget nucleic acid in the information region of the probes than withprobes which are mismatched in the information regions), wherein thetarget nucleic acid is not attached to a fixed support, and wherein theprobes are not attached to a fixed support; (b) detecting a subset ofprobes that hybridize with the target nucleic acid, preferably using areader capable of detecting an individual probe molecule; and (c)determining the sequence of the target nucleic acid from two or more ofthe probes detected in step (b).

[0027] Depending on the conditions of hybridization and the use ofpooling methods, described in further detail below, step (b) may includedetection of more probes than the subset that hybridizes with the targetnucleic acid. However, the SBH process and algorithms are very robustand can handle a large number of false positive probes, as discussedmore fully in U.S. Provisional Application Serial No. 60/115,284 filedJan. 6, 1999, and related co-owned, co-pending U.S. application Ser. No.09/479,608 filed Jan. 6, 2000, both of which are incorporated herein byreference.

[0028] Determining the sequence in step (c) can be done, for example, byactually overlapping the sequences of some (e.g., two or more, three ormore, or four or more) or all of the detected probes for the targetnucleic acid, or by comparing the detected set of probes for the targetnucleic acid (which may be correlated with the identity of a nucleicacid sample to serve as a signature for identifying the nucleic acidsample) to the detected set for another target nucleic acid.

[0029] Optionally, between steps (a) and (b), a step of separatingprobe:target hybrids that are perfectly complementary in the informationregion of the probe from probe:target hybrids that are mismatched in theinformation region of the probe is carried out. Alternatively, betweensteps (a) and (b), a step of covalently joining probes that formcontiguous probe:target hybrids that are perfectly complementary to thetarget in the information region of the probes is carried out, and instep (b) a subset of covalently joined probes is detected.

[0030] The probes may be associated with identification tags. Each ofthe probes may be associated with a unique identification tag;alternatively, the probes may be divided into informational pools, andall probes within each informational pool are associated with anidentification tag unique to the informational pool. A preferredidentification tag is a DNA bar code. The target nucleic acid or theprobes may also be associated with one or more separator tags that aidseparation of the probe:target hybrids from the unhybridized nucleicacids.

[0031] The number of probes used in the hybridization step may be atleast about 10, at least about 100, at least about 1000, at least about10⁴, at least about 10⁵, at least about 10⁶, or at least about 10⁷different probes (meaning the number of probe sequences distinct intheir information regions), and may potentially range up to about 10¹¹different probes or even more. According to this method, the mixture ofprobe molecules comprises at least two and preferably many more probemolecules each having different information regions.

[0032] In detection step (b), the positive probes may be detected usinga reader comprising a nanopore channel. Detection may take place via,e.g., sensing of electrical responses within or around the nanoporechannel as the probe molecule passes through or over the pore. Inpreferred embodiments of the method, a reader comprising a nanoporechannel detects a DNA bar code associated with each probe, or detectsthe sequence of the probe itself. Alternatively, the positive probes maybe detected using any suitable reader known in the art that is capableof detecting single probe molecules.

[0033] Another aspect of the invention provides an apparatus comprisingmeans for carrying out the hybridization step and means for carrying outthe detecting step, as described above. Such an apparatus preferablycomprises a reader comprising a nanopore channel.

[0034] A further aspect of the invention provides sets of probes in theform of one or more kits, each set comprising a mixture of differentprobes. The set of probes may comprise at least about 10, 100, 1000,10⁴, 10⁵, 10⁶, 10⁷, 10⁸, 10⁹, or 10¹¹ different probes (meaning thenumber of probe sequences distinct in their information region).Preferably each information region is represented by about 10⁴ or moreprobe molecules (which may include degenerate ends). Each probe in theset may be associated with one or more identification tags andoptionally one or more separator tags. Each probe in the set need not beassociated with a unique identification tag, particularly if the kit isintended for use with pooling methods in SBH.

[0035] The improved SBH efficiency provided by the present invention isparticularly advantageous for sequencing and resequencing applicationsthat require an extremely large number of probes. Examples ofapplications that require very large numbers of probes are: (1)sequencing or resequencing of the entire human genome and other complexgenomes, (2) sequencing or resequencing of total mRNA or cDNA in a humanor other complex cell, (3) genotyping thousands or millions of singlenucleotide polymorphisms in individual human genomes, (4) de novosequencing of thousands of bases. Potentially, all of the probes thatwould be needed to perform these types of sequence analysis could beused in a single solution-based hybridization reaction with the targetpolynucleotide sequence.

[0036] The methods and materials of the invention also may be useful forcarrying out DNA computing, described, for example, in Ouyang et al.,Science, 278:446-9 (1997), Guarnieri et al., Science, 273:220-3 (1996).

[0037] Numerous additional aspects and advantages of the invention willbecome apparent to those skilled in the art upon consideration of thefollowing detailed description of the invention which describespresently preferred embodiments thereof.

BRIEF DESCRIPTION OF THE DRAWINGS

[0038]FIG. 1 shows an exemplary bar code based on magnetic, optical orconductivity properties.

DETAILED DESCRIPTION OF THE INVENTION

[0039] The three major steps of SBH are biochemical hybridization ofprobes to target polynucleotide, detection of positive results (that areexpected to include probes that are fully complementary to the targetnucleic acid), and informational sequence analysis or assembly from theresults, which may involve overlapping sequences of the informationregions of the probes.

[0040] The present invention provides novel methods and materials,including apparatus, for performing SBH wherein the entire hybridizationstep is performed in a solution environment. The present inventionprovides improvements in efficiency, sensitivity and accuracy of SBH incomparison to conventional SBH methods, and can be used for any types ofsequence analysis that conventional SBH methods are useful for. Theprocedure has many applications in nucleic acid diagnostics, forensics,and gene mapping. It also may be used to discover mutations andpolymorphisms including single nucleotide polymorphisms (SNP) in aselected portion of a gene, the full gene, the entire genome, or asubset of the genome, to identify mutations responsible for geneticdisorders and other traits, to verify the identity of nucleic acidfragments, to identify infectious agents, specific strains thereof, ormutants thereof (including viruses, bacteria, fungi, and parasites), toidentify nucleic acid in samples for forensic purposes or for parentalidentification, to assess biodiversity and to produce many other typesof data dependent on nucleic acid sequence. See, e.g., Examples 19through 27 of Int'l Publication No. WO 98/31836 published Jul. 23, 1998and WO 99/09217 published Feb. 28, 1999, both of which are incorporatedherein by reference.

[0041] The SBH methods of the present invention differs fromconventional SBH methods primarily because (1) the hybridizationreaction is carried out entirely in solution (i.e., neither the probesnor the target nucleic acids are attached to a fixed support), (2) amixture of a large number of probes can be hybridized to the targetnucleic acid at the same time in solution, and (3) the detection ofpositive probes is carried out at the level of single molecules, ratherthan hundreds or thousands of molecules (e.g., if positive probes aredetected using mass spectrometry, hybridization of perhaps a thousandprobes would be required to generate a detectable signal, oralternatively if probes are radioactively labeled, hybridization ofperhaps 10⁵ probes would be required to generate a detectable positivesignal).

[0042] Conventional SBH is a well developed technology that may bepracticed by a number of methods known to those skilled in the art. Forexample, variations of techniques related to sequencing by hybridizationare described in the following documents, all of which are incorporatedby reference herein: Drmanac et al., U.S. Pat. No. 5,202,231 (herebyincorporated by reference herein)—Issued Apr. 13, 1993; Drmanac et al.,U.S. Pat. No. 5,525,464 (hereby incorporated by reference herein)—IssuedJun. 11, 1996; Drmanac, PCT Patent Appln. No. WO 95/09248 (herebyincorporated by reference); Drmanac et al., Genomics, 4, 114-128 (1989);Drmanac et al., Proceedings of the First Int'l. Conf. ElectrophoresisSupercomputing Human Genome Cantor et al. eds, World Scientific Pub.Co., Singapore, 47-59 (1991); Drmanac et al., Science, 260, 1649-1652(1993); Lehrach et al., Genome Analysis: Genetic and Physical Mapping,1, 39-81 (1990), Cold Spring Harbor Laboratory Press; Drmanac et al.,Nucl. Acids Res., 4691 (1986); Stevanovic et al., Gene, 79, 139 (1989);Panusku et al., Mol. Biol. Evol., 1, 607 (1990); Nizetic et al., Nucl.Acids Res., 19, 182 (1991); Drmanac et al., J. Biomol. Struct. Dyn., 5,1085 (1991); Hoheisel et al., Mol. Gen., 4, 125-132 (1991); Strezoska etal., Proc. Nat'l. Acad. Sci. (USA), 88, 10089 (1991); Drmanac et al.,Nucl. Acids Res., 19, 5839 (1991); and Drmanac et al., Int. J. GenomeRes., 1, 59-79 (1992).

[0043] Conventional SBH approaches use arrays of target samples whichare hybridized to labeled probes (Format 1), or arrays of probes whichare hybridized to labeled target samples (Format 2), for efficientparallel scoring of multiple hybridization events. In one approach,either target samples or probes are attached to solid supports in theform of beads that serve to separate parallel hybridization reactions inthe reading or detection step. Beads or other markers can be used astags to identify probes. Mass spectrometry technology can also be usedto distinguish probe species on the basis of their mass even when theprobes are not tagged.

[0044] Format 1, 2 and 3 SBH methods are described in further detailbelow. In addition, a set of probes can be scored in the form ofinformative pools with minimal loss of information, as described in U.S.Provisional Application Serial No. 60/115,284 entitled “EnhancedSequencing by Hybridization Using Informative Pools of Probes” filedJan. 6, 1999, and related co-owned, co-pending U.S. application Ser. No.09/479,608 filed Jan. 6, 2000, both of which are incorporated herein byreference. Other types of pools may be used.

[0045] In addition, pooling probes that cannot be distinguished in thereading step is possible, and probes with unique tags can be multiplexedin the same hybridization reaction. For example, probes that aredifficult to distinguish during the reading step on the basis of theirsequence alone are categorized in an informative pool, wherein theentire pool is scored as positive if any one of the probes is detected.As another example, probes can be grouped and tagged as follows. Probesthat are difficult to distinguish during the reading step on the basisof their sequence alone are put in a group, and each distinct probemolecule (i.e., different within its information region) within a groupis attached to an identification tag that is unique within that group,although the tag may be repeated in a different group. Alternatively,probes that are easy to distinguish on the basis of sequence alone canbe put in a group, and each distinct probe molecule within the group isattached to a tag common to the group. Thus, probes having verysimilar-appearing sequences may be distinguished on the basis of theirdifferent identification tags, while probes having verydifferent-appearing sequences could share the same identification tagand may be distinguished on the basis of sequence alone. In this way,multiple groups could be combined together and still allow distinctprobe molecules to be distinguished from each other. In anotherembodiment, probes that have very different-appearing sequences need notbe tagged or labeled at all but are discriminated on the basis ofsequence.

[0046] Probes can be individually synthesized in separate reactions orin situ, or a combination of two much smaller sets of shorter probes maybe used to score a much larger set of longer probes (1024 5-mers×10245-mers=1,048,576 10-mers).

[0047] The length of the target sequence that can be analyzed or read(the “read length”) using SBH depends on the length of the probes andthe availability of additional information. For example, when a knowngene is sequenced for the purpose of detecting polymorphisms ormutations in individuals, the known reference sequence will aid thesequence analysis. If the hybridization information is used alone,without any additional information, the read length for 10-mer probes isabout 800 bases. The read length approximately doubles for everyadditional base extension of the probe length (e.g., the read length for11-mer probes is about 1600 bases).

[0048] As the length of the probes increases, so does the number ofprobes required to generate sequence information. Each one-base increasein the length of the probe (providing a two-fold increase in readlength) requires a four-fold increase in the number of probes requiredbecause a complete set of probes of length k contains 4^(k) probes. Forexample, sequencing 100 bases of DNA requires 16,384 7mers; sequencing200 bases requires 65,536 8-mers; 400 bases, 262,144 9-mers; 800 bases,1,048,576 10-mers; 1600 bases, 4,194,304 11-mers; 3200 bases, 16,777,21612-mers; 6400 bases, 67,108,864 13-mers; and 12,800 bases requires268,435,456 14-mers.

[0049] In principle, the SBH method can read the entire human genome ina single hybridization reaction if an extremely large number oflong-enough probes is used. Previously available methods involvingarray-based or support-based hybridization, however, were limited by thenumber of probes that could be attached to a single array or by thenumber of supports that could be used at once. The present inventionallows an extremely large number of probes to be hybridized in a singlesolution-based reaction.

[0050] According to the present invention, a set of oligonucleotideprobes consisting of a very large number of different probes is allowedto hybridize to an unknown target polynucleotide or a mixture of unknowntarget polynucleotides in solution. The subset of probes that hybridizesto the target polynucleotide is then selected from the negative probesand analyzed by a reader that detects and discriminates individualoligonucleotide molecules. For any given target sequence, most of theprobes will be negative. For example, when sequencing a 200 bp targetnucleic acid with 65,536 8-mers, only about 200 probes will be scored aspositive (for a single-stranded polynucleotide, or 400 probes will bescored as positive for a double-stranded polynucleotide). Thus, theactual number of probes in the positive probe subset that must bedetected is relatively small in comparison to the original set ofprobes.

[0051] The advantages provided by the present invention are: 1)simplified preparation of pools of probes and parallel scoring of anextremely large number of probe; 2) efficient hybridization andenzymatic kinetics in solution; 3) single molecule sensitivity, allowing(a) DNA analysis without PCR or other amplification reactions, (b) highaccuracy by performing stringent hybridization and enzymatic or chemicalelimination of mismatches, (c) scoring very large number of probes(especially when informative pools are used) with no background thatmight otherwise be prohibitive if probes are conventionally labeled.

[0052] The methods provided by the invention include a method comprisingthe steps of: (a) contacting a target nucleic acid with a mixture (ormultiple mixtures) of a plurality of probes of predetermined length andpredetermined sequence, wherein each probe comprises an informationregion, under conditions which discriminate between probe:target hybridsthat are perfectly complementary in the information region of the probeand probe:target hybrids that are mismatched in the information regionof the probe (i.e., under conditions which produce, on average, moreprobe:target hybridization with probe molecules which are perfectlycomplementary to the target nucleic acid in the information region ofthe probes than with probe molecules which are mismatched in theinformation regions), optionally wherein neither the target nucleic acidnor the probes are attached to a solid support; (b) detecting a subsetof probes that hybridize with the target nucleic acid, optionallyincluding a step of separating out the positive subset of probes; and(c) determining the sequence of the target nucleic acid from two or moreof the probes detected in step (b), e.g., by compiling the overlappingsequences of the detected probes, or by otherwise analyzing the detectedsubset of probes.

[0053] For example, the sequence of the target nucleic acid may bedetermined by comparing the “signature” set of detected probes (andoptionally overlapping and assembling some or all of the sequences ofthis probe set) to that of the “signature” set of detected probes foranother target nucleic acid. A variety of algorithms and other methodsare known in the art for analyzing the data obtained from the detectionstep. See, e.g., Int'l Publication No. WO 98/31836 published Jul. 23,1998 (especially Examples 11 through 17 and 28 through 29 of WO98/31836) and WO 99/09217 published Feb. 28, 1999, both of which areincorporated herein by reference. Further methods for data analysis inSBH, including data analysis when pooling methods are used, andstatistical analysis of overlapping probe sequences and scores that neednot involve assignment of “positive” or “negative” scores (includingcontinuous scoring methods, rescoring methods, and maximum likelihooddetermination methods), are described in U.S. Provisional ApplicationSerial No. 60/115,284 filed Jan. 6, 1999, and related co-owned,co-pending U.S. application Ser. No. 09/479,608 filed Jan. 6, 2000, bothof which are incorporated herein by reference.

[0054] The plurality of probes used in step (a) may be an extremelylarge number, for example, about 10⁴, 10⁵, 10⁶, 10⁷, 10⁸, 10⁹, or 10¹⁰probes. The mixture of probe molecules comprises at least two, at leastthree, at least four, at least 10, at least 16, at least 32, or moreprobe molecules each having different information regions. In step (a)either the target nucleic acid or the probes may be attached to a solidsupport, but step (a) is preferably carried out when the target nucleicacid and all of the probes are in solution (i.e., not attached to afixed support such as a membrane or array or bead). Probes areoptionally attached to identification tags or separator tags asdescribed below.

[0055] In step (b), the positive probes (probes that hybridize with thenucleic acid) may be detected while remaining in the hybridizationreaction solution, or the positive probes may first be selected orseparated out from the negative probes, optionally by use of associatedseparator tag(s), before being identified. Further detail is providedbelow in the section entitled “Selection of Positive Probes.”

[0056] The positive probes may be identified directly by reading theirnucleotide sequence, or indirectly by detecting a unique identificationtag (which may be a string of unit tags forming a compositeidentification tag) associated with a single probe or an identificationtag associated with an informative pool of probes. Further detail onsuitable readers capable of single probe molecule detection is providedbelow in the section entitled “Readers Used to Detect Positive Probes.”

[0057] In a preferred embodiment, probes are labeled with DNA or otherbar codes and are grouped in informative pools. The positive probes areselected from the negative probes, and the positive probes are detectedusing a reader comprising a nanopore channel.

Target Polynucleotide

[0058] “Target nucleic acid” or “target polynucleotide” refers to thenucleic acid of interest, typically the nucleic acid that is sequencedin the SBH assay. Potential target polynucleotides include naturallyoccurring or artificially created DNA (e.g., genomic DNA and cDNA) andRNA (e.g., mRNA), including nucleic acids used as part of DNA computing.The target nucleic acid may be composed of ribonucleotides,deoxyribonucleotides or mixtures thereof. Typically, the target nucleicacid is a DNA. While the target nucleic acid can be double-stranded, itis preferably single stranded. The “read length” of the target nucleicacid can be any number of nucleotides, depending on the length of theprobes, but is typically on the order of 100, 200, 400, 800, 1600, 3200,6400, or even more nucleotides in length, up to the entire human genome.

[0059] The target nucleic acid can be obtained from virtually any sourceand can be prepared using methods known in the art. For example, targetnucleic acids can be isolated by PCR methodology, or by cloning intoplasmids (for a convenient target nucleic acid fragment length of 500 to5,000 base pairs), or by cloning into yeast or bacterial artificialchromosomes (for a convenient target nucleic acid fragment length of upto 100 kb).

[0060] Depending on the desired length for use in the SBH assay, thetarget nucleic acid may be sheared into fragments prior to use in an SBHassay. Fragmentation may be accomplished by nonspecific endonucleasedigestion, restriction enzyme digestion (e.g., by Cvi JI), physicalshearing (e.g., by ultrasound) or NaOH treatment. Fragments may beseparated by size (e.g., by gel electrophoresis) to obtain the desiredfragment length. Fragmentation of the target nucleic acid also may avoidhindrance to hybridization from secondary structure in the sample. Thesizes of the target nucleic acid fragments used in the hybridizationreaction optimally range in length from slightly longer than the probelength to twice the probe length, e.g., 10-100 or 10-40 bases.

[0061] Probes

[0062] “Probes” refers to relatively short pieces of nucleic acids,preferably DNA. Probes are preferably shorter than the target DNA by atleast one base, and more preferably they are 25 bases or fewer inlength, still more preferably 20 bases or fewer in length. Of course,the optimal length of a probe will depend on the length of the targetnucleic acid being analyzed. For a target nucleic composed of about 100or fewer bases, the probes are at least 7-mers; for a target of about100-200 bases, the probes are at least 8-mers; for a target nucleic acidof about 200-400 bases, the probes are at least 9-mers; for a targetnucleic acid of about 400-800 bases, the probes are at least 10-mers;for a target nucleic acid of about 800-1600 bases, the probes are atleast 11-mers; for a target of about 1600-3200 bases, the probes are atleast 12-mers, for a target of about 3200-6400 bases, the probes are atleast 13-mers; and for a target of about 6400-12,800 bases, the probesare at least 14-mers. For every additional two-fold increase in thelength of the target nucleic acid, the optimal probe length is oneadditional base. Those of skill in the art will recognize that forFormat 3 SBH applications, the above-delineated probe lengths arepost-ligation. Thus, as used throughout, specific probe lengths refer tothe actual length of the probes for Format 1- and 2-like SBHapplications and the lengths of ligated probes for Format 3 or Format3-like SBH applications. Probes are normally single stranded, althoughdouble-stranded probes may be used in some applications.

[0063] Probes may be prepared using standard chemistry procedures knownin the art. The length “K” of the probes described above refers to thelength of the informational content (i.e., the information region or theinformative region) of the probes, not necessarily the actual physicallength of the probes. The probes used in SBH frequently containdegenerate ends [e.g., one to three non-specified (mixed A,T,C and G) oruniversal (e.g. M base or inosine) bases at the ends] that aidhybridization but do not contribute to the information content of theprobes. Hybridization discrimination of mismatches in these degenerateprobe mixtures refers only to the length of the informational content,not the full physical length. For example, SBH applications frequentlyuse mixtures of probes of the formula NxByNz, wherein N represents anyof the four bases and varies for the polynucleotides in a given mixture,B represents any of the four bases but is the same for each of thepolynucleotides in a given mixture, and x, y, and z are all integers. Inthis formula, Nx and Nz represent the degenerate ends of the probe andBy represents the information content of the probe (e.g., a uniquelyarrayed probe in conventional SBH).

[0064] According to the present invention, a single information regionmay be represented by not only multiple probe molecules of exactly thesame sequence but also multiple probe molecules that differ in sequenceoutside of the information region (e.g., degenerate ends).

[0065] The probes may consist solely of naturally-occurring nucleotidesand native phosphodiester backbones, or the probes may be modified ortagged to enhance specificity of detection. For example, the probes maybe composed of one or more modified bases, such as 7-deazaguanosine, orone or more modified backbone interlinkages, such as a phosphorothioate.The only requirement is that the probes be able to hybridize to thetarget nucleic acid. A wide variety of modified bases and backboneinterlinkages that can be used in conjunction with the present inventionare known, and will be apparent to those of skill in the art. Forexample, modified bases that are about twice the size of a conventionalnucleotide are known in the art. Modifications that increase or decreasethe size of the units at each base position are expected to improve theability of some readers to discriminate the unit at each base position.

[0066] Other variations include the use of oligonucleotides to increasespecificity or efficiency, cycling hybridizations to increase thehybridization signal, for example by performing a hybridization cycleunder conditions (e.g. temperature) optimally selected for a first setof labeled probes followed by hybridization under conditions optimallyselected for a second set of labeled probes. Shifts in reading frame maybe determined by using mixtures (preferably mixtures of equimolaramounts) of probes ending in each of the four nucleotide bases A, T, Cand G.

[0067] The oligonucleotide probes are preferably tagged withidentification tags to enhance detection or discrimination and areoptionally tagged with separator tags to aid selection, as describedbelow in the section entitled “Selection of Positive Probes.” Accordingto one embodiment, enough identification tags are used so that eachprobe of distinct sequence in the information region can be tagged witha unique identification tag and so that positive probes can beidentified during the reading/detection step using these identificationtags. Alternatively, when informative pools of probes are used, all theprobes in each informative pool are tagged with the same identificationtag that is unique to the informative pool but different from otherinformative pools.

[0068] Examples of suitable identification tags include nanobeads ornanoparticles, polymers or molecules, which may have different size,shape, electrical (e.g., conductivity), magnetic, optical (e.g.,opacity), chemical or other properties matched with the appropriatetypes of readers. Appropriate readers are described below in the sectionentitled “Readers Used to Detect Positive Probes.” Preferably theidentification tags are capable of being multiplexed, i.e. theirproperties can be varied (including by use in a bar code) to allow thepreparation of multiple unique identification tags.

[0069] Nanoparticles can be of any size that is capable of detectionaccording to the methods of the invention, but are preferably less thanabout 500 nm, more preferably about 1 to about 100 nm, and mostpreferably in the range of about 2 to about 20 nm. In one exemplaryembodiment, nanoparticles can be fragments of carbon single-wallednanotubes or multi-walled nanotubes.

[0070] Nanotubes may have different diameters, between 1-2 nm for singlewalled nanotubes and 2-25 nm for multiwalled nanotubes. Also nanotubesmay be cut to different lengths. This provides a nice system for makingbar codes with elements that have alternating diameter. For example from9 elements (nanotubes of three different diameter D1, D2, D3, and eachtype cut at three different lengths (L1,L2,L3). Each element may bemodified at both ends with a linker that may be blocked and deblocked toallow coupling in a process analogous to DNA synthesis. There are manydifferent alternating orders of coupling the elements of differentdiameter in a chain containing N elements. For example for a chain offour elements the orders include D1-D2-D1-D2-Probe or D2-D1-D2-D1-Probe,D1-D3-D1-D3-Probe, D2-D3-D2-D3-Probe, D1-D2-D3-D1, and many others3×2exp3; for each order 3exp4 different bar codes can be produced.

[0071] One exemplary embodiment of a label capable of single moleculedetection is the use of plasmon-resonant particles (PRPs) as opticalreporters, as described in Schultz et al., Proc. Nat'l Acad. Sci.,97:996-1001 (2000), incorporated herein by reference. PRPs are metallicnanoparticles, typically 40-100 nm in diameter, which scatter lightelastically with remarkable efficiency because of a collective resonanceof the conduction electrons in the metal (i.e., the surface plasmonresonance). The magnitude, peak wavelength, and spectral bandwidth ofthe plasmon resonance associated with a nanoparticle are dependent onthe particle's size, shape, and material composition, as well as thelocal environment. By influencing these parameters during preparation,PRPs can be formed that have scattering peak anywhere in the visiblerange of the spectrum. For spherical PRPs, both the peak scatteringwavelength and scattering efficiency increase with larger radius,providing a means for producing differently colored labels. Populationsof silver spheres, for example, can be reproducibly prepared for whichthe peak scattering wavelength is within a few nanometers of thetargeted wavelength, by adjusting the final radius of the spheres duringpreparation. Because PRPs are so bright, yet nanosized, they can be usedas indicators for single-molecule detection; that is, the presence of abound PRP in a field of view can indicate a single binding event.

[0072] A string of consecutive or nonconsecutive unit tags can be linkedtogether to form a “bar code,” e.g. a “molecular bar code” or a“nanoparticle bar code.” A set of distinct molecules or particles mayeven be linked or cross-linked in various combinations to form a type ofthree-dimensional bar code; for example, combinations of 8 very distinctmolecules can form 512 3-element chains or 32,768 5-element chains.Preferably, however, bar codes are chains containing no branches.

[0073] Bar code combinations can also be made by further grouping theunit tags into elements of varying length that each comprise multipleunit tags. In one aspect, the invention provides a bar code tagcomprising an alternate arrangement of elements of varying detectableproperties. Preferably, bar codes are prepared such that consecutiveelements do not have the same detectable properties. In another aspect,bar code tags may include elements that comprise multiple unit tags ofvarying detectable properties and the elements also vary in length. Forexample, in a binary code that uses small (S) and large (L) unit tags,the unit tags can be grouped into short elements (of two unit tags) andlong elements (of four unit tags) as follows: SS small short elementSSSS small long element LL large short element LLLL large long element

[0074] A binary code using these elements to form 2×2⁴ combinations,which would vary in length from 8 to 16 unit tags, might appear asfollows:

[0075] SS-LL-SSSS-LL

[0076] LL-SSSS-LL-SS

[0077] SS-LL-SS-LL

[0078] LLLL-SSSS-LLLL-SSSS

[0079] If three different types of unit tags and three different lengthsof elements are used, there are 3×2^(n-1)×3″ combinations, where n isthe number of elements in the bar code. The advantages of this“alternating” bar code approach are the simplification of combinatorialsynthesis of the probes and bar codes because fewer steps are needed toadd the elements making up the bar code.

[0080] Preferred identification tags are “DNA bar codes” that can bepresent at one or both ends of the probe. For example, a binary digitalbar code can be formed by using a neutral base such as inosine torepresent a “1” and the absence of a base (i.e., phosphodiester backbonealone) to represent a “0.” The number of distinct identification tagsmade possible by this binary system is 2^(N), where N is the number ofdigits in the bar code. Alternatively, the DNA bar codes need not belimited to a binary system if artificial bases are used to providemultiple signals at each position. For example, a small modified basewith a reduced number of groups can be used to signify a “1” and alarger base such as inosine or inosine with additional groups couldsignify a “2,” thereby providing 3 options at each position. Four ormore options at each base position could be provided by using moleculesof different sizes, as long as the molecules are distinguishable by thereader. For example, modified bases that are about twice the size of aconventional nucleotide are known in the art. Modifications thatincrease or decrease the size of the units at each base position areexpected to improve the ability of some readers to discriminate the unitat each base position. The advantages of using DNA bar codes are easiersynthesis and detection of the probes and easier processing of theresulting information.

[0081] Similarly, combinations in a “molecular bar code” are madepossible by consecutively linking molecules (other than nucleotides) ofdifferent size, shape, electrical, magnetic or other properties that canbe discriminated from each other. Optionally, one end of the molecularbar code or of the probe molecule has a common tag that shows theorientation of the oligonucleotide probe molecule (i.e., an orientationmarker).

[0082] Tagging of probes at both ends with identification tags may allowfor identifying 5′/3′ orientation of probes in reading. In addition,when a combination of two smaller sets of shorter probes is used in aFormat 3-like method, which is equivalent to the use of a much largerset of longer probes (by ligating the two shorter probes together),further useful combinations of tags are generated (i.e., a new “double”tag is generated).

[0083] Bit based, or “bar code” labeling systems could provide specificlabeling of thousands of oligonucleotides. The bead or “bar code” couldbe made of a magnetic, optical, or conductive type of media, much likeCD or disk drive media currently used. These micro bar codes would besynthesized as cylinders between 10-50 nm around and no longer than 200nm. The fabrication process, sputtering, would use technology common tothe semiconductor industry. The sputtering process can be utilized todeposit layers of almost any substance on a surface with a very precise,controlled thickness. Alternating layers of two different substances canbe generated to create the bar code, then cut into cylinders. Theselayers of material would be of distinguishable type i.e., clear vs.opaque for optical detection or conductive vs. insulating for detectionby conductivity.

[0084] Under some conditions, it is also possible that fluorescent dyesmay be detected at a single molecule level. Multiple fluorochromescurrently exist and are commonly used in flow cytometry. Thesefluorescent markers can be covalently attached to bead resins togenerate distinct bead labels.

[0085] Microspheres with multiple fluorescent molecular fillings,different materials, surface texture, surface patterns, etc. can beutilized as identification tags. Probes would be covalently bound tooligonucleotide probes during oligonucleotide synthesis. Fluorescentlyfilled microspheres are currently available from Molecular Probes, Inc.and other companies. Microspheres as small as 20 nm diameter polystyrenebeads are currently available.

[0086] Spectral absorption labels may also be used. A possiblemethodology for detection would be to mix into the bead polymerdifferent materials that absorb and pass different spectra of light.Each different type of bead could be detected by passing amulti-spectral light though the bead and detecting which spectra areabsorbed.

[0087] A complete set of all possible probes of a given length (4^(N),where N is the length) or a subset of this complete set may be used inthe hybridization step. Probes of differing lengths may also be used. Alarge number of probes may be synthesized in a small number ofreactions. For example, a complete set of all possible 10-mers (about 1million probes) may be synthesized as follows. 1000 5-mers, eachuniquely associated with a 10-digit DNA bar code, are synthesized in1000 reactions and mixed. The mixture is divided into 1000 aliquotswhich then undergo 1000 reactions, during which the informational lengthof the probe is extended by an additional 5 nucleotides and the 10-digitbar code is extended by a further 10 digits, to form 1 million uniquelytagged 10-mers synthesized in only 2000 reactions.

[0088] To make a mixture of oligonucleotides in small number of reactionand to have each different sequence tagged with a unique tag acombinatorial addition of bases and tag units has to be performed.Example of making all 6-mers (4096) in 6×4 reactions. In the first setof four reactions, four different unit tags are coupled for each of fournucleotides. These unit tags may be four different monomers or forexample four dimers of two different monomers premade or synthesized intwo sets in these four reactions. The products from four reactions aremixed and approximately equimolary split in second set of fourreactions. In each of these reactions, one of the nucleotides is coupledto form a dimer, and a unique unit of tag is linked to previous unit oftag. The result in one of four reactions may be represented asT1-T1-A-A, T1-T2-C-A, T1-T3-G-A and T1-T4-T-A, where each different tagT is unique or may represent a unique oligomer, e.g., T1=t1, T2=t1-t2,T3=t1-t3, T4=t1-t4.

[0089] Conventional labeling of probes that requires detection ofmultiple molecules is not required according to the present invention,but if desired, oligonucleotide probes may be labeled with fluorescentdyes, chemiluminescent systems, radioactive labels (e.g., ³⁵S, ³H, ³²Por ³³P) or with isotopes detectable by mass spectrometry, by any of avariety of methods that are well known in the art.

[0090] Hybridization Reaction

[0091] The number and type of probes that are used in eachsolution-based hybridization reaction depends on the detection/taggingresolving power of the reader, the statistics of informative pools (orother pools) for probes having different sequences that cannot bediscriminated by the reader, and the SBH application (e.g., whether denovo sequencing, resequencing or genotyping is desired). A complete setof all possible probe sequences of the same length may be used, or onlya portion of this complete set may be used. Alternatively, probes ofdiffering length may be used. The only requirement is that there beoverlap among the sequences of the information region of some, notnecessarily all, of the probe molecules.

[0092] According to the present invention, purely statistical factorssuggest that the reader should detect about 5-10 of the probe moleculesthat hybridize to the target nucleic acid. Taking into account the factthat only a certain fraction of available probe molecules that couldhybridize will actually hybridize to the target nucleic acid, the factthat the reader is not expected to read all of the probe molecules thathybridize to the target nucleic acid, and other additional factors suchas the use of degenerate ends, ideally each information region sequenceis represented by about 10⁴ or less, or 10⁵ or less probe molecules,although about 5000 or less, or about 10³ or less or about 500 or less,or about 102 or less may suffice.

[0093] Before positive probes are selected or detected, methods known inthe art can be used to eliminate mismatched target:probe hybrids, e.g.,enzymes specific to mismatched regions of hybridization or chemicalsthat degrade mismatched hybrids.

[0094] Hybridization Conditions

[0095] Hybridization and washing conditions may be selected to detectsubstantially perfect match hybrids (such as those wherein the fragmentand probe hybridize at six out of seven positions), or may be selectedto permit detection only of perfect match hybrids by the use of morestringent hybridization conditions.

[0096] Exemplary conditions according to the following invention includeconditions which produce, on average, more probe:target hybridizationwith probe molecules which are perfectly complementary to the targetnucleic acid in the information region of the probe molecules than withprobe molecules which are mismatched in the information region. Suchconditions may allow discrimination of probe molecules which areperfectly complementary to target from those which are not.

[0097] Suitable hybridization conditions may be routinely determined byoptimization procedures or pilot studies. Such procedures and studiesare routinely conducted by those skilled in the art to establishprotocols for use in a laboratory. See e.g., Ausubel et al., CurrentProtocols in Molecular Biology, Vol. 1-2, John Wiley & Sons (1989);Sambrook et al., Molecular Cloning A Laboratory Manual, 2nd Ed., Vols.1-3, Cold Springs Harbor Press (1989); and Maniatis et al., MolecularCloning: A Laboratory Manual, Cold Spring Harbor Laboratory Cold SpringHarbor, N.Y. (1982), all of which are incorporated by reference herein.For example, conditions such as temperature, concentration ofcomponents, hybridization and washing times, buffer components, andtheir pH and ionic strength may be varied. See also Int'l PublicationNo. WO 98/31836 published Jul. 23, 1998 and WO 99/09217 published Feb.28, 1999, both of which are incorporated herein by reference.

[0098] Selection of Positive Probes

[0099] The subset of positive probes, i.e., those probes which havehybridized to the target polynucleotide during the hybridizationreaction step, may be selected or optionally separated out from theinitial pool of probes in a variety of different ways. For example,double-stranded probe:target hybrids that have hybridized to targetnucleic acid can be selected from single-stranded negative probes byenzymatic digestion of single-stranded or mismatched probes usingenzymes that recognize single-stranded nucleic acid or mismatchedhybridized nucleic acid (if all of the single-stranded nucleic acid isdigested, then the remaining nucleic acids are the positive probes). Anysuitable exonuclease or endonuclease known in the art may be used;exemplary enzymes include mung bean nuclease, exonuclease VII, Bal31nuclease and nuclease S1 (which not only degrades single stranded endsbut may also cleave at small mismatched gaps). See generally, Maniatis,Molecular Cloning: A Laboratory Manual, Cold Spring Harbor Laboratory,Cold Spring Harbor (1982).

[0100] Alternatively, double-stranded probe:target hybrids can beseparated from single-stranded nucleic acid by any procedure known inthe art, such as hydroxyl apatite chromatography, gel electrophoresis,or other gradients.

[0101] In yet a further embodiment, the target nucleic acid or theprobes can be associated with a separator tag, e.g., biotin orfluorescein, and positive probes that have hybridized to the targetnucleic acid can be selected by using the separator tag to separate thetarget nucleic acid from the solution, e.g., with avidin or ananti-fluorescein antibody. Any suitable binding partner known in the artcan be used as a separator tag, e.g., biotin-streptavidin,antigen-antibody binding partner pairs (such as FLAG tag-antibody),histidine tag-nickel, calmodulin binding peptide-calmodulin,dihydrofolate reductase-methotrexate, maltose binding protein-amylose,chitin binding domain-chitin, cellulose binding domain-cellulose, andglutathione-S-transferase (GST)-glutathione, and other binding partnerpairs known in the art.

[0102] As another example, when a Format 3-like type SBH method is used,in which two sets of shorter probes are combinatorially connected(ligated) to simulate a much larger set of longer probes, two shorterprobes are ligated when they hybridize contiguously on the targetpolynucleotide. If each set of shorter probes is commonly tagged and adifferent common separator tag is used for each set, e.g., tag A for oneset and tag B for the second set, the set of ligated positive probeswill have both tag A and tag B but the negative unligated probes willonly have one tag (either tag A or tag B). These separator tags thusallow separation of the positive probes in a consecutive two-stepprocess.

[0103] In one specific embodiment, a set of 8-mers tagged with biotinand a set of 9-mers tagged with fluorescein are used in thehybridization reaction; probes that hybridize contiguously to the targetform positive 17-mer probes tagged with both biotin and fluorescein. The17-mers can then be separated from the solution by contacting thesolution first with support-bound avidin followed by a wash to removeunbound probes and target, and then contacting the solution withsupport-bound anti-fluorescein antibody, followed by a wash to removeunbound probes (or vice versa).

[0104] Readers Used to Detect Positive Probes

[0105] Different types of readers may be used to detect probes thathybridize to the target nucleic acid under conditions that allowdiscrimination of perfectly complementary probes from mismatched probes.The requirements for a reader useful according to the present inventionare: 1) ability to detect individual oligonucleotide probes at amolecular level and discriminate them from other oligonucleotide probes(either by determining the probe sequence itself or by detecting one,two or more identification tags associated with the probe); 2) capacityto discriminate a large number (hundreds or thousands) of differentoligonucleotides; and 3) sufficient detection speed (using one or manyintegrated detection channels) to be able to perform statisticalsampling sufficient for detecting all positive oligonucleotide probesresulting from the hybridization reaction step. In contrast,conventional detection methods using, e.g., radioisotope labelingrequire the presence of hundreds or thousands of positive probes beforea detectable signal is generated.

[0106] If the probe molecule contains an identification tag, preferablythe identification tag is detected by the reader without cleaving thetag from the probe. In addition, preferably the detection step takesplace in solution. Most preferably, many integrated detectionchannels/readers are used to improve detection speed by parallelprocessing.

[0107] Suitable readers include readers based on nanopore channels, inwhich the associated identification tag or the probe sequence itself isread as each probe molecule passes through or over the channel. Othersuitable readers include readers based on molecular imaging, in whichmicroscopy at the molecular level allows the discrimination ofidentification tags. Any other sensor-based method known in the art maybe useful according to the present invention if it fulfills the generalcriteria set forth above.

[0108] One preferred reader is based on detection of electrical or otherresponses at a small pore when an oligonucleotide molecule or itsassociated tag passes through or over the pore. See, e.g., Church etal., U.S. Pat. No. 5,795,782, incorporated herein by reference, andrelated publications. As described therein, the molecules can be inducedto cross over or through the pore, e.g., by a polymerase or othertemplate-dependent polymer replicating catalyst linked to the pore orlocated close to the pore, or by a voltage gradient or electric field(i.e., electrophoresis). See also the related publication Meller et al.,Proc. Nat'l Acad. Sci., 97:1079-1084 (2000), incorporated herein byreference.

[0109] The size of these nanopore channels depends on the types ofmolecules to be detected and may vary without affecting the mode ofdetection; the channel should be large enough to accommodate the probemolecule and any associated identification tag, yet should be smallenough so that the passage of the probe molecule has a detectable effecton the electrical or other response of the channel. Exemplary sizesinclude pores ranging from about 1 nm to about 100 nm, or preferablyabout 3 to about 30 nm in diameter. In comparison, the diameter of anucleotide is typically about 0.3 nm.

[0110] Instead of pores, instruments based on various types of gates orchannels may be used to read the probe molecules or their associatedidentification tags as they pass through or pass by. Two layers of(identical or different) pores may be used to read 5′ and 3′ tags of theprobe.

[0111] When the oligonucleotide probes are not modified or tagged,discrimination of the natural nucleotide bases in the probes may be usedto identify the probe. It may be more difficult to discriminate amongall four bases [adenine (A), guanine (G), thymidine (T) and cytosine(C)] individually than it is to differentiate A and G (two relativelylarge purine nucleotides) from T and C (two relatively small pyrimidinenucleotides). For detection of probes according to the presentinvention, it is not necessary to identify all four bases. For example,differentiation between purines and pyrimidines may be sufficient whenprobes in the form of PyPuPyPuPyPu, etc. are used. The 5′ or 3′ endoligonucleotide sequence may also be used in this manner to form adigital bar code.

[0112] As another example, when multicolor optical immunolabels based onmetallic plasmon-resonant nanoparticles are used as identification tagsfor the probes, an optical imaging reader may be used to detect theillumination as described in Schultz et al., Proc. Nat'l Acad. Sci.,97:996-1001 (2000), incorporated herein by reference.

[0113] Alternatively, chemical sensors based on individual single-walledcarbon nanotubes (SWNTs) have been used to detect small concentrationsof toxic gas molecules and may be used to detect the presence ofmolecules used as identification tags. See, e.g., Kong et al., Science,287:622-625 (2000), incorporated herein by reference, which reports thatthe electrical resistance of individual semiconducting SWNTs (of adiameter of approximately 1.8 nm) change by up to three orders ofmagnitude within several seconds of exposure to NO₂ or NH₃ molecules atroom temperature.

[0114] When Format 3-like SBH is used, the positive probes (a ligatedcombination of two shorter probes) may be discriminated from thenegative (unligated) shorter probes on the basis of length or by thepresence of two identification tags. In this embodiment, the positiveprobes (i.e., probes that hybridize under appropriate conditions to thetarget nucleic acid) need not be separated from the negative probes.

[0115] The subset of probes scored as positives may be furtherstatistically analyzed to remove false positives from the subset or toadd false negatives to the subset. For example, the likelihood of probeX being a true positive may be determined by comparing the frequency ofpositive scores of probes which have one base difference from probe X'ssequence. Probes with one base difference from probe X should scorepositive with a certain frequency if probe X is a true positive.

[0116] Counting the number of probes having the same information regionproduces a measure of the extent of hybridization and is especiallyuseful for statistical analysis of overlapped probe sequences.

[0117] Exemplary Applications

[0118] Although theoretically all possible probes can be used in asingle hybridization reaction, the number of probes used in a singlehybridization reaction may be practically limited by the number ofavailable unique identification tags. If the informative poolmethodology is used, each probe need not be uniquely labeled, as acommon tag for the pool can suffice. The number of probes used, thenumber of identification tags used and the number of hybridizationreactions for optimal use of the present methods may vary. Exemplary SBHapplications that require very large number of probes include thefollowing:

[0119] (1) For human and other complex genome resequencing: 3×10¹⁰informative pools of 32 20-mers may be scored in 3000 hybridizationreactions, each of which has a multiplex of 1×10⁷ informative pools,requiring the capacity to discriminate 10 million oligonucleotideprobes;

[0120] (2) For genotyping one million single nucleotide polymorphisms(SNPs) in an individual human genome: 10 million probes maybe needed (1million sites x 2 alleles×5 overlapped probes) and may be used in 1-10independent hybridization/ligation reactions, depending on the taggingstrategy and the selection and grouping of SNPs with different sequenceto allow simultaneous detection and distinguishing of SNPs;

[0121] (3) For de novo sequencing of 100,000 bases with 17-mers: onemillion informative pools of 16,384 17-mers may be scored in one singlehybridization reaction, or one million informative pools may be scoredin 16 reactions (with 65,536 unique tags repeatedly used in eachreaction).

[0122] All three applications may be carried out using hybridizationonly, or with probe extension by DNA polymerase by at least one base, orFormat 3-like SBH in which two sets of shorter probes arecombinatorially connected (ligated) to be the equivalent of a muchlarger set of longer probes. The latter approach offers additionalflexibility in combinatorial probe synthesis (i.e., simultaneoussynthesis of many probes with different sequences and uniqueidentification tags in the same reaction vessel).

[0123] Probe extension by DNA polymerase by at least one base may becarried out essentially as described in, e.g., Cantor, U.S. Pat. No.5,503,980, incorporated herein by reference. Briefly, the probe:targethybrid can be extended by a nucleotide, e.g., by adding a labelednucleotide, such as a ddNTP, and using a polymerase (e.g. a Klenowfragment) to extend the probe molecule. All four nucleotides may beadded at once if they are differently labeled or tagged.

[0124] An example of Format 3-like probe synthesis and scoring for theabove example of de novo sequencing of a 100 kb polynucleotide using acomplete set of 17-mers in 16 hybridization reactions is describedbelow.

[0125] Each of the required 17-mer probes may be formed by combining a9-mer probe and 8-mer probe. The 9-mers may be prepared as follows. Forexample, all 256 possible 4-mers may be synthesized in 256 separatereactions and tagged with 256 unique identification tags, positioned atthe 3′ end. All of the 4-mers are mixed together and then divided into64 aliquots, which each undergo reactions to add one of all 64 possible3-mers, thereby forming all possible 7-mers. Finally, all of the 7-mersare mixed together and divided into 16 aliquots, which each undergoreactions to add one of all 16 possible 2-mers, thereby forming allpossible 9-mers (approximately 262,000 possible 9-mers). The 9-mersremain divided into 16 aliquots or pools, each of which is used in eachof 16 independent hybridization reactions.

[0126] For 8-mers, a mix of all 256 possible 4-mers is made, which thenare mixed together and divided to undergo 256 additional reactions toadd one of all 256 possible 4-mer to each, and to add at the 5′ end onespecific identification tag and one common tag for orientation ofoligonucleotide probe during the detection step. All 256 reactions arethen mixed to form a pool of all 8-mers.

[0127] Instead of using 256 tags for marking 256 4-mers at each end, asmaller number of tags may be used to label one of the two 4-mers thatmight not be sufficiently distinct in the reading process.

[0128] One of the 16 separate pools of 9-mers and a pool of all of the8-mers is hybridized with the target nucleic acid under appropriateconditions. When the hybridization reaction is completed the ligationproducts are selected (for example by consecutive binding of separatortags, as described above), and detected by the reader. Its orientationcan be defined by the 5′ tag from 8-mers (e.g., an anion). In thiscombination of informative pools and tags a specific signal for a 17-meris the combination of two signals at both 5′ and 3′ ends. At each endeither a specific 4-mer or 4-mer bar code tag is detected. Every one of65,536 different combinations of signals represent a unique 4-mer+4-merand 16,384 different middle 9-mer sequences in each of the 16 poolshybridized in independent reactions, thus forming informative pools of16,384 17-mers.

[0129] For a 100 kb target nucleic acid, only about 200,000 17-mers arepositive (1 in 80,000) but each of these 200,000 positive probes shouldbe represented about 10 times to allow a statistical sampling to berepresentative of all positive probes (resulting in a total of 2 millionpositive probe molecules). To sample almost all 17-mers, optimally aboutone million probe molecules are scored, a majority of which will beredundant and will therefore have been confirmed by multiple detection.

[0130] Conventional Format 1 and 2 SBH

[0131] A. Assay Format

[0132] Format 1 SBH is appropriate for the simultaneous analysis of alarge set of samples. Parallel scoring of thousands of samples on largearrays may be performed in thousands of independent hybridizationreactions using small pieces of membranes. The identification of DNA mayinvolve 1-20 probes per reaction and the identification of mutations mayin some cases involve more than 1000 probes specifically selected ordesigned for each sample. For identification of the nature of themutated DNA segments, specific probes may be synthesized or selected foreach mutation detected in the first round of hybridizations.

[0133] DNA samples may be prepared in small arrays which may beseparated by appropriate spacers, and which may be simultaneously testedwith probes selected from a set of oligonucleotides which may be arrayedin multiwell plates. Small arrays may consist of one or more samples.DNA samples in each small array may include mutants or individualsamples of a sequence. Consecutive small arrays may be organized intolarger arrays. Such larger arrays may include replication of the samesmall array or may include arrays of samples of different DNA fragments.A universal set of probes includes sufficient probes to analyze a DNAfragment with prespecified precision, e.g. with respect to theredundancy of reading each base pair (“bp”). These sets may include moreprobes than are necessary for one specific fragment, but may includefewer probes than are necessary for testing thousands of DNA samples ofdifferent sequence.

[0134] DNA or allele identification and a diagnostic sequencing processmay include the steps of:

[0135] 1) Selection of a subset of probes from a dedicated,representative or universal set to be hybridized with each of aplurality of small arrays;

[0136] 2) Adding a first probe to each subarray on each of the arrays tobe analyzed in parallel;

[0137] 3) Performing hybridization and scoring of the hybridizationresults;

[0138] 4) Stripping off previously used probes;

[0139] 5) Repeating hybridization, scoring and stripping steps for theremaining probes which are to be scored;

[0140] 5) Processing the obtained results to obtain a final analysis orto determine additional probes to be hybridized;

[0141] 6) Performing additional hybridizations for certain subarrays;and

[0142] 7) Processing complete sets of data and computing obtaining afinal analysis.

[0143] This approach provides fast identification and sequencing of asmall number of nucleic acid samples of one type (e.g. DNA, RNA), andalso provides parallel analysis of many sample types in the form ofsubarrays by using a presynthesized set of probes of manageable size.Two approaches have been combined to produce an efficient and versatileprocess for the determination of DNA identity, for DNA diagnostics, andfor identification of mutations.

[0144] For the identification of known sequences, a small set of shorterprobes may be used in place of a longer unique probe. In this approach,although there may be more probes to be scored, a universal set ofprobes may be synthesized to cover any type of sequence. For example, afull set of 6-mers includes only 4,096 probes, and a complete set of7-mers includes only 16,384 probes.

[0145] Full sequencing of a DNA fragment may be performed with twolevels of hybridization. One level is hybridization of a sufficient setof probes that cover every base at least once. For this purpose, aspecific set of probes may be synthesized for a standard sample. Theresults of hybridization with such a set of probes reveal whether andwhere mutations (differences) occur in non-standard samples. Todetermine the identity of the changes, additional specific probes may behybridized to the sample.

[0146] In another variation, all probes from a universal set may bescored. A universal set of probes allows scoring of a relatively smallnumber of probes per sample in a two step process without an undesirableexpenditure of time. The hybridization process may involve successiveprobings, in a first step of computing an optimal subset of probes to behybridized first and, then, on the basis of the obtained results, asecond step of determining additional probes to be scored from amongthose in a universal set.

[0147] B. Sequence Assembly

[0148] In SBH sequence assembly, K-1 oligonucleotides which occurrepeatedly in analyzed DNA fragments due to chance or biological reasonsmay be subject to special consideration. If there is no additionalinformation, relatively small fragments of DNA may be fully assembled inas much as every base pair is read several times.

[0149] In the assembly of relatively longer fragments, ambiguities mayarise due to the repeated occurrence in a set of positively-scoredprobes of a K-1 sequence (i.e., a sequence shorter than the length ofthe probe). This problem does not exist if mutated or similar sequenceshave to be determined (i.e., the K-1 sequence is not identicallyrepeated). Knowledge of one sequence may be used as a template tocorrectly assemble a sequence known to be similar (e.g. by its presencein a database) by arraying the positive probes for the unknown sequenceto display the best fit on the template.

[0150] Within DNA, the location of certain probes may be interchangeablewhen determined by overlapping the sequence data, resulting in anambiguity as to the position of the partial sequence. Although thesequence information is determined by SBH, either: (i) long read length,single-pass gel sequencing at a fraction of the cost of complete gelsequencing; or (ii) comparison to related sequences, may be used toorder hybridization data where such ambiguities (“branch points”) occur.In addition, segments in junk DNA (which is not found in genes) may berepeated many times in tandem. Although the sequence of the segments isdetermined by SBH, single-pass gel sequencing may be used to determinethe number of tandem repeats where tandemly-repeated segments occur. Astandem repeats occur rarely in protein-encoding portions of a gene, thegel-sequencing step will be performed only when a commercial value forthe sequence is determined.

[0151] C. Sequencing of Mutants

[0152] The use of an array of sample arrays avoids consecutive scoringof many oligonucleotides on a single sample or on a small set ofsamples. This approach allows the scoring of more probes in parallel bymanipulation of only one physical object. Subarrays of DNA samples 1000bp in length may be sequenced in a relatively short period of time. Ifthe samples are spotted at 50 subarrays in an array and the array isreprobed 10 times, 500 probes may be scored. In screening for theoccurrence of a mutation, approximately 335 probes may be used to covereach base three times. If a mutation is present, several covering probeswill be affected. The use of information about the identity of negativeprobes may map the mutation with a two base precision. To solve a singlebase mutation mapped with this precision, an additional 15 probes may beemployed. These probes cover any base combination for two questionablepositions (assuming that deletions and insertions are not involved).These probes may be scored in one cycle on 50 subarrays which contain agiven sample. In the implementation of a multiple label color scheme(i.e., multiplexing), two to six probes, each having a different labelsuch as a different fluorescent dye, may be used as a pool, therebyreducing the number of hybridization cycles and shortening thesequencing process.

[0153] In more complicated cases, there may be two close mutations orinsertions. They may be handled with more probes. For example, a threebase insertion may be solved with 64 probes. The most complicated casesmay be approached by several steps of hybridization, and the selectingof a new set of probes on the basis of results of previoushybridizations.

[0154] If subarrays to be analyzed include tens or hundreds of samplesof one type, then several of them may be found to contain one or morechanges (mutations, insertions, or deletions). For each segment wheremutation occurs, a specific set of probes may be scored. The totalnumber of probes to be scored for a type of sample may be severalhundreds. The scoring of replica arrays in parallel facilitates scoringof hundreds of probes in a relatively small number of cycles. Inaddition, compatible probes may be pooled. Positive hybridizations maybe assigned to the probes selected to check particular DNA segmentsbecause these segments usually differ in 75% of their constituent bases.

[0155] By using a larger set of longer probes, longer targets may beconveniently analyzed. These targets may represent pools of shorterfragments such as pools of exon clones.

[0156] D. Identification of Heterozygotes Using SBH

[0157] A specific hybridization scoring method may be employed to definethe presence of heterozygotes (sequence variants) in a genomic segmentto be sequenced from a diploid chromosomal set. Two variations arewhere: i) the sequence from one chromosome represents a basic type andthe sequence from the other represents a new variant; or, ii) bothchromosomes contain new, but different variants. In the first case, thescanning step designed to map changes gives a maximal signal differenceof two-fold at the heterozygotic position. In the second case, there isno masking, but a more complicated selection of the probes for thesubsequent rounds of hybridizations may be indicated.

[0158] Scoring two-fold signal differences required in the first casemay be achieved efficiently by comparing corresponding signals withcontrols containing only the basic sequence type and with the signalsfrom other analyzed samples. This approach allows determination of arelative reduction in the hybridization signal for each particular probein a given sample. This is significant because hybridization efficiencymay vary more than two-fold for a particular probe hybridized withdifferent DNA fragments having the same full match target. In addition,heterozygotic sites may affect more than one probe depending upon thenumber of oligonucleotide probes. Decrease of the signal for two to fourconsecutive probes produces a more significant indication ofheterozygotic sites. Results may be checked by testing with small setsof selected probes among which one or few probes selected to give a fullmatch signal which is on average eight-fold stronger than the signalscoming from mismatch-containing duplexes.

[0159] Partitioned membranes allow a very flexible organization ofexperiments to accommodate relatively larger numbers of samplesrepresenting a given sequence type, or many different types of samplesrepresented with relatively small numbers of samples. A range of 4-256samples can be handled with particular efficiency. Subarrays within thisrange of numbers of dots may be designed to match the configuration andsize of standard multiwell plates used for storing and labelingoligonucleotides. The size of the subarrays may be adjusted fordifferent number of samples, or a few standard subarray sizes may beused. If all samples of a type do not fit in one subarray, additionalsubarrays or membranes may be used and processed with the same probes.In addition, by adjusting the number of replicas for each subarray, thetime for completion of identification or sequencing process may bevaried.

[0160] Signature Analysis with SBH

[0161] Obtaining information about the degree of hybridization exhibitedfor a set of only about 200 oligonucleotides probes (about 5% of theeffort required for complete sequencing) defines a unique signature ofeach gene and may be used for sorting the cDNAs from a library todetermine if the library contains multiple copies of the same gene. Bysuch signatures, identical, similar and different cDNAs can bedistinguished and inventoried.

[0162] Format 3 Sequencing by Hybridization

[0163] Format 3 SBH (as well as Formats 1 and 2) is described more fullyin, e.g., Int'l Publication Nos. WO 98/31836 published Jul. 23, 1998 andWO 99/09217 published Feb. 28, 1999, incorporated herein by reference.In Format 3, a first set of oligonucleotide probes of known sequence isimmobilized on a solid support under conditions which permit them tohybridize with nucleic acids having respectively complementarysequences. A labeled, second set of oligonucleotide probes is providedin solution. Both within the sets and between the sets the probes may beof the same length or of different lengths. A nucleic acid to besequenced or intermediate fragments thereof may be applied to the firstset of probes in double-stranded form (especially where a recA proteinis present to permit hybridization under non-denaturing conditions), orin single-stranded form and under conditions which permit hybrids ofdifferent degrees of complementarity (for example, under conditionswhich discriminate between full match and one base pair mismatchhybrids). The nucleic acid to be sequenced or intermediate fragmentsthereof may be applied to the first set of probes before, after orsimultaneously with the second set of probes. A ligase or other means ofcausing chemical bond formation between adjacent, but not betweennonadjacent, probes may be applied before, after or simultaneously withthe second set of probes. After permitting adjacent probes to bechemically bonded, fragments and probes which are not immobilized to thesurface by chemical bonding to a member of the first set of probe may bewashed away, for example, using a high temperature (up to 100 degreesC.) wash solution which melts hybrids. The bound probes from the secondset may then be detected using means appropriate to the label employed(which may be chemiluminescent, fluorescent, radioactive, enzymatic ordensitometric, for example). Herein, nucleotide bases “match” or are“complementary” if they form a stable duplex by hydrogen bonding underspecified conditions. For example under conditions commonly employed inhybridization assays, adenine (“A”) matches thymine (“T”), but notguanine (“G”) or cytosine (“C”). Similarly, G matches C, but not A or T.Other bases which will hydrogen bond in less specific fashion, such asinosine or the Universal Base (“M” base, Nichols et al 1994), or othermodified bases, such as methylated bases, for example, are complementaryto those bases for which they form a stable duplex under specifiedconditions. A probe is said to be “perfectly complementary” or is saidto be a “perfectly match” if each base in the probe forms a duplex byhydrogen bonding to a base in the nucleic acid to be sequenced. Eachbase in a probe that does not form a stable duplex is said to be a“mismatch” under the specified hybridization conditions.

[0164] A list of probes may be assembled wherein each probe is a perfectmatch to the nucleic acid to be sequenced. The probes on this list maythen be analyzed to order them in maximal overlap fashion. Such orderingmay be accomplished by comparing a first probe to each of the otherprobes on the list to determine which probe has a 3′ end which has thelongest sequence of bases identical to the sequence of bases at the 5′end of a second probe. The first and second probes may then beoverlapped, and the process may be repeated by comparing the 5′ end ofthe second probe to the 3′ end of all of the remaining probes and bycomparing the 3′ end of the first probe with the 5′ end of all of theremaining probes. The process may be continued until there are no probeson the list which have not been overlapped with other probes.Alternatively, more than one probe may be selected from the list ofpositive probes, and more than one set of overlapped probes (“sequencenucleus”) may be generated in parallel. The list of probes for eithersuch process of sequence assembly may be the list of all probes whichare perfectly complementary to the nucleic acid to be sequenced or maybe any subset thereof.

[0165] The 5′ and 3′ ends of sequence nuclei may be overlapped togenerate longer stretches of sequence. Where ambiguities arise insequence assembly due to the availability of alternative proper overlapswith probes or sequence nuclei, hybridization with longer probesspanning the site of overlap alternatives, competitive hybridization,ligation of alternative end to end pairs of probes spanning the site ofambiguity or single pass gel analysis (to provide an unambiguousframework for sequence assembly) may be used.

[0166] By employing the above procedures, one may obtain any desiredlevel of sequence, from a pattern of hybridization (which may becorrelated with the identity of a nucleic acid sample to serve as asignature for identifying the nucleic acid sample) to overlapping ornon-overlapping probes up through assembled sequence nuclei and on tocomplete sequence for an intermediate fragment or an entire source DNAmolecule (e.g. a chromosome).

[0167] Sequencing may generally comprise the following steps:

[0168] (a) contacting an array of immobilized oligonucleotide probeswith a nucleic acid fragment under conditions effective to allow afragment with a sequence complementary to that of an immobilized probeto form a primary complex with the immobilized probe such that thefragment has a hybridized and a non-hybridized portion;

[0169] (b) contacting a primary complex with a set of labeledoligonucleotide probes in solution under conditions effective to allow aprimary complex including an unhybridized sequence complementary to thatof a labeled probe to hybridize to the labeled probe, thereby forming asecondary complex wherein the fragment is hybridized with both animmobilized probe and a labeled probe;

[0170] (c) removing from a secondary complex any labeled probe that hasnot hybridized adjacent to an immobilized probe;

[0171] (d) detecting the presence of adjacent labeled and unlabeledprobes by detecting the presence of the label; and

[0172] (e) determining a nucleotide sequence of the fragment byconnecting the known sequence of the immobilized and labeled probes.

[0173] In this embodiment of SBH, ligation may be implemented by achemical ligating agent (e.g. water-soluble carbodiimide or cyanogenbromide). A ligase enzyme, such as the commercially available T4 DNAligase from T4 bacteriophage, may be employed. The washing conditionswhich are selected to distinguish between adjacent versus nonadjacentlabeled and immobilized probes are selected to make use of thedifference in stability of continuously stacked or ligated adjacentprobes.

[0174] Numerous modifications and variations in the practice of theinvention are expected to occur to those skilled in the art uponconsideration of the foregoing description of the presently preferredembodiments thereof. Consequently, the only limitations which should beplaced upon the scope of the present invention are those which appear inthe appended claims.

What is claimed is:
 1. A method of detecting a sequence of a targetnucleic acid, comprising: (a) contacting a target nucleic acid with oneor more mixtures of a plurality of oligonucleotide probe molecules ofpredetermined length and predetermined sequence, wherein each probemolecule comprises an information region and at least two probemolecules have different information regions, under conditions whichproduce, on average, more probe:target hybridization with probemolecules which are perfectly complementary to the target nucleic acidin the information region of the probe molecules than with probemolecules which are mismatched in the information region, wherein thetarget nucleic acid is not attached to a support, and wherein the probemolecules are not attached to a support; (b) detecting probe moleculesthat hybridize with the target nucleic acid, using a reader capable ofdetecting an individual probe molecule; and (c) detecting a sequence ofthe target nucleic acid by overlapping sequences of the informationregions of at least two of the probe molecules contacted with the targetin step (a).
 2. The method of claim 1 wherein at least two mixtures arecontacted simultaneously.
 3. The method of claim 1 wherein at least twomixtures are contacted sequentially.
 4. The method of claim 1 whereinthe mixture of probe molecules comprises at least about 10 probemolecules distinct in their information regions.
 5. The method of claim1 wherein the mixture of probe molecules comprises at least about 100probe molecules distinct in their information regions.
 6. The method ofclaim 1 wherein the mixture of probe molecules comprises at least about1,000 probe molecules distinct in their information regions.
 7. Themethod of claim 1 wherein the mixture of probe molecules comprises atleast about 10,000 probe molecules distinct in their informationregions.
 8. The method of claim 1 wherein the probe molecules comprisemodified bases.
 9. The method of claim 1 wherein multiple probemolecules are associated with identification tags.
 10. The method ofclaim 9 wherein multiple probe molecules each have two identificationtags.
 11. The method of claim 9 wherein multiple probe molecules havingthe same information region are each associated with the sameidentification tag.
 12. The method of claim 9 wherein at least two probemolecules having different information regions are associated withdifferent identification tags.
 13. The method of claim 9 wherein theprobe molecules are divided into pools, wherein each pool comprises atleast two probe molecules having different information regions, and allprobe molecules within each pool are associated with the sameidentification tag which is unique to the pool.
 14. The method of claim9 wherein at least one identification tag is a bar code.
 15. The methodof claim 14 wherein the bar code is based on a property selected fromthe group consisting of size, shape, electrical properties, magneticproperties, optical properties, and chemical properties.
 16. The methodof claim 14 wherein the identification tag is a DNA bar code comprisingmodified bases.
 17. The method of claim 14 wherein the identificationtag is a molecular bar code.
 18. The method of claim 14 wherein theidentification tag is a nanoparticle bar code.
 19. The method of claim14 wherein the bar code comprises elements of varying length, eachelement comprising a preset number of unit tags.
 20. The method of claim1 wherein the target nucleic acid is associated with a separator tag.21. The method of claim 1 wherein the probe molecules are associatedwith separator tags.
 22. The method of claim 1 wherein before detectionstep (b), probe molecules that hybridize to the target nucleic acid areseparated from probe molecules that do not hybridize to the targetnucleic acid.
 23. The method of claim 22 wherein probe molecules that donot hybridize to the target nucleic acid are eliminated by enzymaticdigestion.
 24. The method of claim 1 wherein step (b) further comprisescounting the number of times probe molecules having the same informationregion are detected.
 25. The method of claim 1 wherein a readercomprising a nanopore channel is used to detect probe molecules in step(b).
 26. The method of claim 25 wherein sensing of electrical responseswithin or around the nanopore channel is used to detect probe moleculesin step (b).
 27. The method of claim 25 wherein the reader detectsmolecular bar codes in step (b).
 28. The method of claim 1 wherein theprobe molecules are associated with one or more tags that allowidentification of 5′/3′ orientation of probe molecules during detectionstep (b).
 29. The method of claim 1 wherein the sequence of the probemolecule(s) is detected in step (b).
 30. The method of claim 29 whereinat least two probe molecules are associated with identification tags andthe identification tags are also detected in step (b).
 31. A method ofsequencing a target nucleic acid, comprising: (a) contacting a targetnucleic acid with one or more mixtures of a plurality of oligonucleotideprobe molecules of predetermined length and predetermined sequence,wherein each probe molecule comprises an information region and at leasttwo probe molecules have different information regions, under conditionswhich produce, on average, more probe:target hybridization with probemolecules which are perfectly complementary to the target nucleic acidin the information region of the probe molecules than with probemolecules which are mismatched in the information region, wherein thetarget nucleic acid is not attached to a support, and wherein the probemolecules are not attached to a support; (b) covalently joining probemolecules that form contiguous probe:target hybrids that are perfectlycomplementary to the target in the information region of the probemolecules; and (c) detecting covalently joined probe molecules, using areader capable of detecting an individual probe molecule.
 32. The methodof claim 31 further comprising the step of: (d) detecting a sequence ofthe target nucleic acid by overlapping at least two sequences generatedby combining sequences of the information region of two probe moleculescontacted with target nucleic acid in step (a).
 33. The method of claim31 wherein before detection step (c), covalently joined probe moleculesare separated from probe molecules that have not been covalently joined.34. The method of claim 1 or 18 wherein at least one nucleotide is addedto the end of one or more probe molecules that hybridize to targetnucleic acid using a polymerase or active fragment thereof.
 35. Themethod of claim 34 wherein the probe molecules are contacted with amixture of four different uniquely labeled nucleotides.
 36. The methodof claim 1 wherein target nucleic acids comprising an entire humangenome are contacted with probe molecules.
 37. The method of any one ofclaims 1, 18 or 20 wherein a single nucleotide polymorphism is detected.38. A kit comprising a mixture of probe molecules, wherein about 100 ormore probe molecules each have distinct information regions, wherein twoor more of the sequences of said distinct information regions within themixture overlap
 39. The kit of claim 38 wherein about 10⁵ or less probemolecules each have the same information region.
 40. The kit of claim 38wherein about 10⁴ or less probe molecules each have the same informationregion.
 41. The kit of claim 38 wherein each information region isrepresented by 10⁴ or more probe molecules having the same informationregion.
 42. The kit of claim 38 wherein at least two probe moleculeshaving the same information region have the same identification tag. 43.A kit comprising a set of mixtures of probe molecules, wherein about 100or more probe molecules each have distinct information regions, whereintwo or more of the sequences of said distinct information regions withinthe set overlap
 44. The kit of claim 43 wherein about 10⁵ or less probemolecules each have the same information region.
 45. The kit of claim 43wherein at least two probe molecules having different informationregions are in the same pool and have the same identification tag. 46.The kit of claim 38 or 43 wherein about 5000 or more probe moleculeseach have the same information region.
 47. A tag which is a bar codecomprising an alternating arrangement of elements of varying detectableproperties, wherein consecutive elements have a difference in at leastone of their detectable properties.
 48. The tag of claim 47 wherein saidelements comprise multiple unit tags of varying detectable propertiesand said elements vary in length.